Tidy Data
“Happy families are all alike; every unhappy family is unhappy in its
own way.” –– Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its
own way.” –– Hadley Wickham
- Key ideas:
- Cases = Rows
- Variables = Columns
- How should we define case?
- How do we identify variables?
- Advantages and Disadvantages
Vocabulary
Variable
- In data science, the word variable has a different meaning than in
mathematics.
- In algebra, a variable is an unknown quantity.
- In data, a variable is known; it represents a feature that has been
measured or observed. “Variable” refers to a specific quantity or
quality that can vary from one case to another.
- Types of variables
- quantitative : a number
- categorical (R calls these factors): tells which category or group a
case falls into
- all non-numerical values are categorical, but not all numerical
values are quantitative
- e.g. zip code, IP address, dates
Cases
- Unit of observation or analysis
- this is extremly context specific
What is Tidy Data
- Being neat is not what makes data tidy!
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation/case must have its own row.
- Each value must have its own cell.
It is your job as the researcher to define the variables,
observations, and values.
- The “tidyness” of the data set depends on the research question. It
is not an inherent property to the data set itself.
- When data are in tidy form, it’s often straightforward to transform
the data into arrangements that are useful for answering interesting
questions.
Example of Untidy data

Example of Tidy Data

- Disadvantages
- tidy data can be hard for human to quickly interpret
- often not the ideal form for creating graphics
- Advantages
- clear definitions
- tidy data can easily be wrangled to a useful form for
interpretation and visualization
Galton Data
In the 1880s, Francis Galton started to make a mathematical theory of
evolution.
Here’s part of a page from his lab notebook. Discuss the following in
groups:
- What might he investigate with these data (e.g., Research
Question)?
- Are these data tidy according to our
definition?
- What are the cases?
- What are the variables?
- How many rows of data should the result have?
- How many columns of data should the result have?
What is the data type of each column?
- What are some additional variables (not yet shown) that might be of
interest? How would you recommend showing that information in the data
table?
Activity 01: Tidy Data
Work to put these tables in tidy form
- Work with your partner
- As a team, you will put two different data sets into “tidy”
form.
- See Canvas for details
- View-only source data is provided
- use any software you like
- must submit a CSV to Canvas
- do not use spaces in your file names
- Tip: Sketch things out together on paper before you do
anything in the computer
Table 1: Galton’s Height measurements data
Table 2: Presidents

Code Books
What is a code book?
A codebook describes the contents, structure,
and layout of a data collection.
A well-documented codebook contains information intended to be
complete and self-explanatory for each variable in a data file
https://www.icpsr.umich.edu/web/ICPSR/cms/1983
Federal Elections Comission
LS0tCnRpdGxlOiAiTDAyIC0gVGlkeSBEYXRhIgphdXRob3I6IAotICJQcmVzZW50ZXI6IE9saXZpYSBCZWNrIiAKLSAiQ29udGVudCBDcmVkaXQ6IE1hdHRoZXcgQmVja21hbiwgSGFkbGV5IFdpY2toYW0iCmRhdGU6ICJNYXkgMTcsIDIwMjMiCgpvdXRwdXQ6IAogIHNsaWR5X3ByZXNlbnRhdGlvbjogZGVmYXVsdAogIGh0bWxfbm90ZWJvb2s6IGRlZmF1bHQKCi0tLQoKCgoKIyMgVGlkeSBEYXRhCgrigJxIYXBweSBmYW1pbGllcyBhcmUgYWxsIGFsaWtlOyBldmVyeSB1bmhhcHB5IGZhbWlseSBpcyB1bmhhcHB5IGluIGl0cyBvd24gd2F5LuKAnSDigJPigJMgTGVvIFRvbHN0b3kKCuKAnFRpZHkgZGF0YXNldHMgYXJlIGFsbCBhbGlrZSwgYnV0IGV2ZXJ5IG1lc3N5IGRhdGFzZXQgaXMgbWVzc3kgaW4gaXRzIG93biB3YXku4oCdIOKAk+KAkyBIYWRsZXkgV2lja2hhbQoKCi0gS2V5IGlkZWFzOgogIC0gQ2FzZXMgPSBSb3dzCiAgLSBWYXJpYWJsZXMgPSBDb2x1bW5zIAotIEhvdyBzaG91bGQgd2UgZGVmaW5lICoqY2FzZSoqPwotIEhvdyBkbyB3ZSBpZGVudGlmeSAqKnZhcmlhYmxlcyoqPwotIEFkdmFudGFnZXMgYW5kIERpc2FkdmFudGFnZXMgCgojIyBWb2NhYnVsYXJ5IAoKKipWYXJpYWJsZSoqIAoKLSBJbiBkYXRhIHNjaWVuY2UsIHRoZSB3b3JkIHZhcmlhYmxlIGhhcyBhIGRpZmZlcmVudCBtZWFuaW5nIHRoYW4gaW4gbWF0aGVtYXRpY3MuIAogIC0gSW4gYWxnZWJyYSwgYSB2YXJpYWJsZSBpcyBhbiB1bmtub3duIHF1YW50aXR5LiAKICAtIEluIGRhdGEsIGEgdmFyaWFibGUgaXMga25vd247IGl0IHJlcHJlc2VudHMgYSBmZWF0dXJlIHRoYXQgaGFzIGJlZW4gbWVhc3VyZWQgb3Igb2JzZXJ2ZWQuIOKAnFZhcmlhYmxl4oCdIHJlZmVycyB0byBhIHNwZWNpZmljIHF1YW50aXR5IG9yIHF1YWxpdHkgdGhhdCBjYW4gdmFyeSBmcm9tIG9uZSBjYXNlIHRvIGFub3RoZXIuCiAgCi0gVHlwZXMgb2YgdmFyaWFibGVzCiAgLSBxdWFudGl0YXRpdmUgOiBhIG51bWJlcgogIC0gY2F0ZWdvcmljYWwgKFIgY2FsbHMgdGhlc2UgZmFjdG9ycyk6IHRlbGxzIHdoaWNoIGNhdGVnb3J5IG9yIGdyb3VwIGEgY2FzZSBmYWxscyBpbnRvCiAgLSBhbGwgbm9uLW51bWVyaWNhbCB2YWx1ZXMgYXJlIGNhdGVnb3JpY2FsLCBidXQgbm90IGFsbCBudW1lcmljYWwgdmFsdWVzIGFyZSBxdWFudGl0YXRpdmUKICAgIC0gZS5nLiB6aXAgY29kZSwgSVAgYWRkcmVzcywgZGF0ZXMgCiAgICAKKipDYXNlcyoqCgotIFVuaXQgb2Ygb2JzZXJ2YXRpb24gb3IgYW5hbHlzaXMgCiAgLSB0aGlzIGlzIGV4dHJlbWx5IGNvbnRleHQgc3BlY2lmaWMgCgoKIyMgV2hhdCBpcyBUaWR5IERhdGEgCgotIEJlaW5nIG5lYXQgaXMgKipub3QqKiB3aGF0IG1ha2VzIGRhdGEgdGlkeSEKCgpUaGVyZSBhcmUgdGhyZWUgaW50ZXJyZWxhdGVkIHJ1bGVzIHdoaWNoIG1ha2UgYSBkYXRhc2V0IHRpZHk6CgoxLiBFYWNoIHZhcmlhYmxlIG11c3QgaGF2ZSBpdHMgb3duIGNvbHVtbi4KMi4gRWFjaCBvYnNlcnZhdGlvbi9jYXNlIG11c3QgaGF2ZSBpdHMgb3duIHJvdy4KMy4gRWFjaCB2YWx1ZSBtdXN0IGhhdmUgaXRzIG93biBjZWxsLgoKSXQgaXMgeW91ciBqb2IgYXMgdGhlIHJlc2VhcmNoZXIgdG8gZGVmaW5lIHRoZSB2YXJpYWJsZXMsIG9ic2VydmF0aW9ucywgYW5kIHZhbHVlcy4gCgotIFRoZSAidGlkeW5lc3MiIG9mIHRoZSBkYXRhIHNldCBkZXBlbmRzIG9uIHRoZSByZXNlYXJjaCBxdWVzdGlvbi4gSXQgaXMgbm90IGFuIGluaGVyZW50IHByb3BlcnR5IHRvIHRoZSBkYXRhIHNldCBpdHNlbGYuIAotIFdoZW4gZGF0YSBhcmUgaW4gdGlkeSBmb3JtLCBpdOKAmXMgb2Z0ZW4gc3RyYWlnaHRmb3J3YXJkIHRvIHRyYW5zZm9ybSB0aGUgZGF0YSBpbnRvIGFycmFuZ2VtZW50cyB0aGF0IGFyZSB1c2VmdWwgZm9yIGFuc3dlcmluZyBpbnRlcmVzdGluZyBxdWVzdGlvbnMuCgoKRXhhbXBsZSBvZiBVbnRpZHkgZGF0YSAKCiFbXShpbWFnZXMvdW50aWR5LWVnLnBuZykKCkV4YW1wbGUgb2YgVGlkeSBEYXRhCgohW10oaW1hZ2VzL3RpZHktZWcucG5nKQoKCi0gRGlzYWR2YW50YWdlcwogIC0gdGlkeSBkYXRhIGNhbiBiZSBoYXJkIGZvciBodW1hbiB0byBxdWlja2x5IGludGVycHJldCAKICAtIG9mdGVuIG5vdCB0aGUgaWRlYWwgZm9ybSBmb3IgY3JlYXRpbmcgZ3JhcGhpY3MKLSBBZHZhbnRhZ2VzIAogIC0gY2xlYXIgZGVmaW5pdGlvbnMKICAtIHRpZHkgZGF0YSBjYW4gZWFzaWx5IGJlICp3cmFuZ2xlZCogdG8gYSB1c2VmdWwgZm9ybSBmb3IgaW50ZXJwcmV0YXRpb24gYW5kIHZpc3VhbGl6YXRpb24gCgoKCiMjIEdhbHRvbiBEYXRhCgpJbiB0aGUgMTg4MHMsIEZyYW5jaXMgR2FsdG9uIHN0YXJ0ZWQgdG8gbWFrZSBhIG1hdGhlbWF0aWNhbCB0aGVvcnkgb2YgZXZvbHV0aW9uLiAgCgpIZXJlJ3MgcGFydCBvZiBhIHBhZ2UgZnJvbSBoaXMgbGFiIG5vdGVib29rLiAgRGlzY3VzcyB0aGUgZm9sbG93aW5nIGluIGdyb3VwczoKCi0gV2hhdCBtaWdodCBoZSBpbnZlc3RpZ2F0ZSB3aXRoIHRoZXNlIGRhdGEgKGUuZy4sICoqUmVzZWFyY2ggUXVlc3Rpb24qKik/Ci0gQXJlIHRoZXNlIGRhdGEgKip0aWR5KiogYWNjb3JkaW5nIHRvIG91ciBkZWZpbml0aW9uPwotIFdoYXQgYXJlIHRoZSAqKmNhc2VzKio/Ci0gV2hhdCBhcmUgdGhlICoqdmFyaWFibGVzKio/Ci0gSG93IG1hbnkgKipyb3dzKiogb2YgZGF0YSBzaG91bGQgdGhlIHJlc3VsdCBoYXZlPwotIEhvdyBtYW55ICoqY29sdW1ucyoqIG9mIGRhdGEgc2hvdWxkIHRoZSByZXN1bHQgaGF2ZT8gIFdoYXQgaXMgdGhlIGRhdGEgdHlwZSBvZiBlYWNoIGNvbHVtbj8KLSBXaGF0IGFyZSBzb21lIGFkZGl0aW9uYWwgdmFyaWFibGVzIChub3QgeWV0IHNob3duKSB0aGF0IG1pZ2h0IGJlIG9mIGludGVyZXN0PyAgSG93IHdvdWxkIHlvdSByZWNvbW1lbmQgc2hvd2luZyB0aGF0IGluZm9ybWF0aW9uIGluIHRoZSBkYXRhIHRhYmxlPwoKCiFbQSBwYWdlIGZyb20gRnJhbmNpcyBHYWx0b24ncyBub3RlYm9vay5dKGltYWdlcy9nYWx0b24tbm90ZWJvb2suanBnKQoKCgojIyBBY3Rpdml0eSAwMTogVGlkeSBEYXRhIAoKV29yayB0byBwdXQgdGhlc2UgdGFibGVzIGluIHRpZHkgZm9ybQoKLSBXb3JrIHdpdGggeW91ciBwYXJ0bmVyIAotIEFzIGEgdGVhbSwgeW91IHdpbGwgcHV0IHR3byBkaWZmZXJlbnQgZGF0YSBzZXRzIGludG8gInRpZHkiIGZvcm0uICAKLSAqKlNlZSBDYW52YXMgZm9yIGRldGFpbHMqKgogICAgLSBWaWV3LW9ubHkgc291cmNlIGRhdGEgaXMgcHJvdmlkZWQKICAgIC0gdXNlIGFueSBzb2Z0d2FyZSB5b3UgbGlrZQogICAgLSBtdXN0IHN1Ym1pdCBhIENTViB0byBDYW52YXMgCiAgICAtIGRvIG5vdCB1c2Ugc3BhY2VzIGluIHlvdXIgZmlsZSBuYW1lcyAKLSBUaXA6ICoqU2tldGNoIHRoaW5ncyBvdXQgdG9nZXRoZXIgb24gcGFwZXIgYmVmb3JlIHlvdSBkbyBhbnl0aGluZyBpbiB0aGUgY29tcHV0ZXIqKgoKCiMjIyMgVGFibGUgMTogKipHYWx0b24ncyBIZWlnaHQgbWVhc3VyZW1lbnRzIGRhdGEqKgoKIVtBIHBhZ2UgZnJvbSBGcmFuY2lzIEdhbHRvbidzIG5vdGVib29rLl0oaW1hZ2VzL2dhbHRvbi1ub3RlYm9vay5qcGcpCgoKIyMjIyBUYWJsZSAyOiAqKlByZXNpZGVudHMqKgoKIVtdKGltYWdlcy9wcmVzaWRlbnRzLmpwZykKCgoKIyMgQ29kZSBCb29rcyAKCiMjIyBXaGF0IGlzIGEgY29kZSBib29rPyAKCi0gQSAqKmNvZGVib29rKiogZGVzY3JpYmVzIHRoZSBjb250ZW50cywgc3RydWN0dXJlLCBhbmQgbGF5b3V0IG9mIGEgZGF0YSBjb2xsZWN0aW9uLiAKLSBBIHdlbGwtZG9jdW1lbnRlZCBjb2RlYm9vayBjb250YWlucyBpbmZvcm1hdGlvbiBpbnRlbmRlZCB0byBiZSBjb21wbGV0ZSBhbmQgc2VsZi1leHBsYW5hdG9yeSBmb3IgZWFjaCB2YXJpYWJsZSBpbiBhIGRhdGEgZmlsZQoKLSBodHRwczovL3d3dy5pY3Bzci51bWljaC5lZHUvd2ViL0lDUFNSL2Ntcy8xOTgzIAoKLSBGZWRlcmFsIEVsZWN0aW9ucyBDb21pc3Npb24gCiAgLSBodHRwczovL3d3dy5mZWMuZ292L2RhdGEvYnJvd3NlLWRhdGEvP3RhYj1idWxrLWRhdGEKICAKICAKIyMgUmVmZXJlbmNlcyAKCi0gaHR0cHM6Ly9kdGthcGxhbi5naXRodWIuaW8vRGF0YUNvbXB1dGluZ0Vib29rL2NoYXAtdGlkeS1kYXRhLmh0bWwjY2hhcDp0aWR5LWRhdGEKLSBodHRwczovL3I0ZHMuaGFkLmNvLm56L3RpZHktZGF0YS5odG1sCi0gaHR0cHM6Ly93d3cuaWNwc3IudW1pY2guZWR1L3dlYi9JQ1BTUi9jbXMvMTk4MwoKCgoK